Overview
We think the analogy to using R is clear:
- If you are anxious, stressed or avoidant you will be distracted
- Getting confident with the basics makes more complex techniques possible
TODO: replace with feelgood video
In this session we cover:
- Loading data from files
- Using simple techniques to answer research questions with data
- Saving intermediate steps using variables
Principles/ideas
- Using data to answer questions
- Precision and literal-mindedness of R
- Paths and directories
R techniques covered
- Storing data in variables
- Passing data to functions using the pipe
%>% - Loading data from elsewhere
- The Files pane
- Uploading data files
- Selecting rows with
filter() - Sorting data using
arrange() - Combining
filter()witharrange() - Summarising data using
summarise() - Grouping data with
group_by() - R comments
Storing data in variables
TODO: replace with video
Video summary:
- In R, a variable is the name for a container which stores data.
- We make variables using the assignment operator, which looks like this:
<-. - Values on the right hand side of
<-are stored in the variable on the left hand side. - Variables that you create are stored in the
Global Environment, which you can see using the Environment pane.
# calculate 40 + 2 and assign the result to a variable
meaning_of_life <- 40 + 2
# print variable
meaning_of_life
[1] 42As we work, it’s useful to be able to save the results of the code we write.
As one example, we might have a dataset with multiple columns, each holding participants’ answers to an individual questionnaire item. We might want to calculate a new column —— maybe an average of each person’s scores on all of the questions —— and keep track of this so we can use it in later calculations.
Alternatively, we might want to save the result of a specific calculation and use it later on.
To do this we can create a variable.
A variable is just a container to store data in. To make variables we use the assignment operator, which looks like this <-
That is, like an arrow that points to the left. This is a reminder that the results of the calculation on the right hand side will be assigned (stored) in the variable on the left hand side.
The code in this chunk runs the calculation on the right hand side of the assignment operator, 40 + 2, and assigns the result to a new variable named meaningoflife. The output of the chunk is 42, the value of meaningoflife.
Give your variables short names which describe the data they contain. Use the underscore _ if you need to use more than one word e.g. meaning_of_life.
You might wonder where these variables get saved. In most cases, variables you create are stored in what’s called the Global Environment. You can see them in the Environment pane in RStudio. Double-clicking on any variable there will show you what is stored inside the container.
Exercise 1
- Open
session-2.rmdusing the Files pane. This is the workbook you will be using in this session. - Run the first chunk in the workbook.
The output should look like this:
Results of creating meaningoflife variable
Your Environment pane should look like this:
Environment pane after creating variable
Exercise 2
- Create a level 3 markdown heading named “Exercise 2” in your workbook
- Create a new chunk beneath the heading
- Assign the results of the calculation
2 * 35to the variableseventy - Run the chunk
Your Environment should now look like this:
Environment pane after creating new variable
Exercise 3
- Create a level 3 markdown heading named “Exercise 3” in your workbook
- Use R to calculate your age in the year 2051.
- Save the result in a variable with a descriptive name.
Passing data to functions using the pipe %>%
TODO: replace with video
Video summary:
- We pass data from one piece of code to another using the pipe function, which looks like this:
%>%. - A pipeline is a sequence of two or more functions joined by
%>%. - You can use the assignment operator to store the results of a pipeline in a variable.
# pipe mtcars into head()
mtcars %>% head()
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
# store first few rows of mtcars
mtcars_head <- mtcars %>%
head()Sometimes we need to link together multiple steps in our analysis.
For example, if we’re working with a big dataset we might want to select only some of the columns, and then filter out some of the rows of data, and the finally calculate descriptive statistics.
We could do this by creating lots of variables, each one saving the results at each intermediate step. This can get confusing, though.
Instead we can use what’s known as a ‘pipe’ — it’s another way to link together multiple instructions.
The pipe sends data from one piece of code to another.
The pipe looks like this %>%.
In session 1, you used this code to “pipe” the mtcars dataset into head(), which shows just the first few rows:
mtcars %>% head()
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1You can think of your data as flowing along lengths of pipe, joined by functions which do things to the data, step by step, until the result you want plops out at the end.
Each %>% should be read as the word “then”, e.g. “pipe mtcars data, then head() it”.
The > in the pipe function reminds you of the direction in which your data is flowing (it only works left to right).
It’s important to know that the pipe doesn’t store the results of these steps.
Sometimes that’s OK. In our first example we just wanted to look at the first few rows of the mtcars data.
But, you will usually want to save the result of a pipeline in a new variable.
For example, if we wanted to save the first few rows of the mtcars data to a new variable we would write:
mtcars_head <- mtcars %>% head()Here we combine assignment with a pipeline.
The result of the pipeline (a data.frame containing the first few rows of mtcars) is saved to a new variable called mtcars_head.
You can explore your variables using the Environment pane. A data.frame will have an icon that looks like a spreadsheet. If you [click on the icon], the data.frame is displayed in a new tab in the Source pane.
This tab shows you the same information as printing the data.frame, such as the number of rows and columns, but it also provides tools for exploring the data interactively.
- The arrows next to the column names allow you to arrange the rows in ascending or descending order based on the column values.
- The
Filterbutton allows you to specify a value for one or more columns to filter out non-matching rows. For example, we could display just cars with 4 gears. Click the button again to turn off the filter.
Exercise 4
- Create a level 3 markdown heading named “Exercise 4” in your workbook. (You should be used to doing this for every exercise by now, so we won’t remind you again.)
- Create a new chunk beneath the heading
- Load the
tidyverselibrary - Pipe the
mpgdata.frameintohead()and assign the results to a variable calledmpg_head - Use the Environment pane to open
mpg_head
In 1999, a 6 cylinder, manual transmission, Audio A4 could cover miles per gallon when driven in the city.
Loading data from elsewhere
TODO: replace with video
Video summary:
- Often we want to load data into R, rather than use built-in datasets.
- The preferred format for data files in R is comma-separated value (CSV).
- CSV data can be read using the
read_csv()function. - You can load data from an internet address (URL) or a file uploaded to the server.
Loading data
In a lot of these sessions we use datasets that are built-in to R because it’s quick and convenient to illustrate the points we make.
[demo opening glancing some built in data like gapminder, iris, mtcars etc]
Normally, though, you will need to load your own data.
R can read data from two places:
- A URL (web address), if the data file is available on the internet somewhere
- A file on computer that R is running on
The link below is a URL (web address) for a file containing data about US police shootings.
The final part of the url tells us the name of the file: shootings.csv
The final 3 (sometimes 4) letters of the filename is called the file extension.
Here the file extension is .csv, which stands for ‘comma separated values’ or CSV.
CSV is a common data type. Most data-oriented programmes (e.g. Excel or Open Office or SPSS) can read and write .csv files, so it’s a good choice for storing and sharing data.
If you click on the link [click link in vid] you’ll see the first line is a list of column names separated by commas.
The remaining lines contain rows of data matching the column headings. For example, the value of the arms_category column in row 1 is Guns.
The read_csv() function reads a CSV file, and converts it to a data.frame, which is the format we use in R.
We can use read_csv() to load data from either a file, or over the internet, which is shown in the next video.
Reading CSV files from the internet
TODO: replace with video
Video summary:
read_csv('http://...')can load data from a URL.- It converts the data to a
data.frame. - You must assign the loaded data to a variable, which you should give a descriptive name.
- Use the Environment pane to view data you load using
read_csv().
# load data from a URL into a variable
shootings <- read_csv('https://benwhalley.github.io/lifesavR/data/shootings.csv')
# display data
shootings
# A tibble: 4,895 x 15
id name date manner_of_death armed age gender race city state
<dbl> <chr> <date> <chr> <chr> <dbl> <chr> <chr> <chr> <chr>
1 3 Tim E… 2015-01-02 shot gun 53 M Asian Shel… WA
2 4 Lewis… 2015-01-02 shot gun 47 M White Aloha OR
3 5 John … 2015-01-03 shot and Tasered unar… 23 M Hisp… Wich… KS
4 8 Matth… 2015-01-04 shot toy … 32 M White San … CA
5 9 Micha… 2015-01-04 shot nail… 39 M Hisp… Evans CO
6 11 Kenne… 2015-01-04 shot gun 18 M White Guth… OK
7 13 Kenne… 2015-01-05 shot gun 22 M Hisp… Chan… AZ
8 15 Brock… 2015-01-06 shot gun 35 M White Assa… KS
9 16 Autum… 2015-01-06 shot unar… 34 F White Burl… IA
10 17 Lesli… 2015-01-06 shot toy … 47 M Black Knox… PA
# … with 4,885 more rows, and 5 more variables: signs_of_mental_illness <lgl>,
# threat_level <chr>, flee <chr>, body_camera <lgl>, arms_category <chr>CSV files are a common format to store and share data. As shown in the previous video, the first line of a CSV file defines the column names, and the remaining lines are rows of data.
The read_csv() function reads a CSV file, and converts it to a data.frame, which is the format we use in R. We can load data either from a file, or over the internet.
In this example, I’m reading a CSV directly over the Internet and storing the resulting data.frame in the variable shootings.
The URL (the link to the CSV file) needs to be in quotes (single or double quotes both work).
shootings <- read_csv('https://benwhalley.github.io/lifesavR/data/shootings.csv')Because we made a new variable, the result is stored in the Environment, and we can double-click it to have a look at the data.
An alternative (and recommended) way is to simply type the name of the variable:
shootings
# A tibble: 4,895 x 15
id name date manner_of_death armed age gender race city state
<dbl> <chr> <date> <chr> <chr> <dbl> <chr> <chr> <chr> <chr>
1 3 Tim E… 2015-01-02 shot gun 53 M Asian Shel… WA
2 4 Lewis… 2015-01-02 shot gun 47 M White Aloha OR
3 5 John … 2015-01-03 shot and Tasered unar… 23 M Hisp… Wich… KS
4 8 Matth… 2015-01-04 shot toy … 32 M White San … CA
5 9 Micha… 2015-01-04 shot nail… 39 M Hisp… Evans CO
6 11 Kenne… 2015-01-04 shot gun 18 M White Guth… OK
7 13 Kenne… 2015-01-05 shot gun 22 M Hisp… Chan… AZ
8 15 Brock… 2015-01-06 shot gun 35 M White Assa… KS
9 16 Autum… 2015-01-06 shot unar… 34 F White Burl… IA
10 17 Lesli… 2015-01-06 shot toy … 47 M Black Knox… PA
# … with 4,885 more rows, and 5 more variables: signs_of_mental_illness <lgl>,
# threat_level <chr>, flee <chr>, body_camera <lgl>, arms_category <chr>Exercise 5
- Create a new chunk.
- Read the data stored at https://benwhalley.github.io/lifesavR/data/shootings.csv
- View it using the Environment pane.
- View it using
glimpse().
Using data from your computer
TODO: replace with video
Video summary:
- Before you can use data from your computer, you must upload it to the server.
- Data can be uploaded using the Files pane.
- Always upload data to the same location as your R code.
- For data you upload, give
read_csv()the path to the CSV file. - You must assign the loaded data to a variable, which you should give a descriptive name.
- Use the Environment pane to view the data.
The Upload button in the Files pane lets you upload a file from your computer to R Studio. R Studio uses file extensions to guess what the file contains. A file extension is a sequence of characters, starting with a . at the end of a file name.
.csv- CSV file.rmd- R Markdown file
Make sure that any file you upload has the correct file extension.
We’ll upload shootings.csv from the previous exercise.
- Click the
Uploadbutton. - Ensure the
Target directoryis where you want the uploaded file to appear. For this module it should read~/lifesavr. The~(pronounced “tilde”) means yourHomedirectory on the R Studio server. The/lifesavrmeans the folder namedlifesaverinHome. - Click the
Choose filebutton and select the file you want to upload. After you select a file, its name appears next to the button. - Click the
**OK**button.
The file should appear in the Files pane in your lifesavr folder.
Exercise 6
- Use your web browser to download https://benwhalley.github.io/lifesavR/data/shootings.csv to your computer.
- Upload
shootings.csvto the server. - Create a new chunk.
- Read
shootings.csvinto a variable with a descriptive name.
In which city was the earliest recorded shooting?
Selecting rows with filter()
TODO: replace with video
Video summary:
- The
filter()function selects rows from a dataset which match criteria we set. - The simplest filter uses
==(equals equals), to test if the row is an exact match. - We can use other filters like
<or>to match criteria in numeric columns. - We can combine multiple filters to get exactly the rows we need.
# load gapminder dataset
library(gapminder)
# filter rows where country is equal to the word "Kenya"
# remember to double equals (==) rather than single (=)
gapminder %>%
filter(country == "Kenya")
# A tibble: 12 x 6
country continent year lifeExp pop gdpPercap
<fct> <fct> <int> <dbl> <int> <dbl>
1 Kenya Africa 1952 42.3 6464046 854.
2 Kenya Africa 1957 44.7 7454779 944.
3 Kenya Africa 1962 47.9 8678557 897.
4 Kenya Africa 1967 50.7 10191512 1057.
5 Kenya Africa 1972 53.6 12044785 1222.
6 Kenya Africa 1977 56.2 14500404 1268.
7 Kenya Africa 1982 58.8 17661452 1348.
8 Kenya Africa 1987 59.3 21198082 1362.
9 Kenya Africa 1992 59.3 25020539 1342.
10 Kenya Africa 1997 54.4 28263827 1360.
11 Kenya Africa 2002 51.0 31386842 1288.
12 Kenya Africa 2007 54.1 35610177 1463.
# select rows where year is greater than 2000
gapminder %>%
filter(year > 2000)
# A tibble: 284 x 6
country continent year lifeExp pop gdpPercap
<fct> <fct> <int> <dbl> <int> <dbl>
1 Afghanistan Asia 2002 42.1 25268405 727.
2 Afghanistan Asia 2007 43.8 31889923 975.
3 Albania Europe 2002 75.7 3508512 4604.
4 Albania Europe 2007 76.4 3600523 5937.
5 Algeria Africa 2002 71.0 31287142 5288.
6 Algeria Africa 2007 72.3 33333216 6223.
7 Angola Africa 2002 41.0 10866106 2773.
8 Angola Africa 2007 42.7 12420476 4797.
9 Argentina Americas 2002 74.3 38331121 8798.
10 Argentina Americas 2007 75.3 40301927 12779.
# … with 274 more rows
# select rows with low life expectancy
gapminder %>%
filter(lifeExp < 35)
# A tibble: 33 x 6
country continent year lifeExp pop gdpPercap
<fct> <fct> <int> <dbl> <int> <dbl>
1 Afghanistan Asia 1952 28.8 8425333 779.
2 Afghanistan Asia 1957 30.3 9240934 821.
3 Afghanistan Asia 1962 32.0 10267083 853.
4 Afghanistan Asia 1967 34.0 11537966 836.
5 Angola Africa 1952 30.0 4232095 3521.
6 Angola Africa 1957 32.0 4561361 3828.
7 Angola Africa 1962 34 4826015 4269.
8 Burkina Faso Africa 1952 32.0 4469979 543.
9 Burkina Faso Africa 1957 34.9 4713416 617.
10 Cambodia Asia 1977 31.2 6978607 525.
# … with 23 more rows
# combine multiple filters
gapminder::gapminder %>%
filter(country=="Kenya") %>%
filter(year > 2000) %>%
filter(lifeExp < 55)
# A tibble: 2 x 6
country continent year lifeExp pop gdpPercap
<fct> <fct> <int> <dbl> <int> <dbl>
1 Kenya Africa 2002 51.0 31386842 1288.
2 Kenya Africa 2007 54.1 35610177 1463.The following chunk filters the gapminder dataset to include only rows where the country column equals “Kenya”.
library(gapminder)
gapminder %>% filter(country == "Kenya")
# A tibble: 12 x 6
country continent year lifeExp pop gdpPercap
<fct> <fct> <int> <dbl> <int> <dbl>
1 Kenya Africa 1952 42.3 6464046 854.
2 Kenya Africa 1957 44.7 7454779 944.
3 Kenya Africa 1962 47.9 8678557 897.
4 Kenya Africa 1967 50.7 10191512 1057.
5 Kenya Africa 1972 53.6 12044785 1222.
6 Kenya Africa 1977 56.2 14500404 1268.
7 Kenya Africa 1982 58.8 17661452 1348.
8 Kenya Africa 1987 59.3 21198082 1362.
9 Kenya Africa 1992 59.3 25020539 1342.
10 Kenya Africa 1997 54.4 28263827 1360.
11 Kenya Africa 2002 51.0 31386842 1288.
12 Kenya Africa 2007 54.1 35610177 1463.The == is called an “operator”. It compares values from the column on the left hand side with the value specified on the right hand side. The value must match the column type. The value "Kenya" was in quotes because the country column is a factor.
The “greater than” operator > filters numeric data.
gapminder %>% filter(year > 2000)
# A tibble: 284 x 6
country continent year lifeExp pop gdpPercap
<fct> <fct> <int> <dbl> <int> <dbl>
1 Afghanistan Asia 2002 42.1 25268405 727.
2 Afghanistan Asia 2007 43.8 31889923 975.
3 Albania Europe 2002 75.7 3508512 4604.
4 Albania Europe 2007 76.4 3600523 5937.
5 Algeria Africa 2002 71.0 31287142 5288.
6 Algeria Africa 2007 72.3 33333216 6223.
7 Angola Africa 2002 41.0 10866106 2773.
8 Angola Africa 2007 42.7 12420476 4797.
9 Argentina Americas 2002 74.3 38331121 8798.
10 Argentina Americas 2007 75.3 40301927 12779.
# … with 274 more rowsThis chunk filters rows where year is greater than 2000.
The opposite of the > operator is the < operator. This filters numeric columns which are less than a value.
Combined filters
gapminder::gapminder %>%
filter(country=="Kenya") %>%
filter(year > 2000)
# A tibble: 2 x 6
country continent year lifeExp pop gdpPercap
<fct> <fct> <int> <dbl> <int> <dbl>
1 Kenya Africa 2002 51.0 31386842 1288.
2 Kenya Africa 2007 54.1 35610177 1463.Exercise 7
Filter gapminder to show countries with a population greater than 100 million.
Your results should look like this:
# A tibble: 77 x 6
country continent year lifeExp pop gdpPercap
<fct> <fct> <int> <dbl> <int> <dbl>
1 Bangladesh Asia 1987 52.8 103764241 752.
2 Bangladesh Asia 1992 56.0 113704579 838.
3 Bangladesh Asia 1997 59.4 123315288 973.
4 Bangladesh Asia 2002 62.0 135656790 1136.
5 Bangladesh Asia 2007 64.1 150448339 1391.
6 Brazil Americas 1972 59.5 100840058 4986.
7 Brazil Americas 1977 61.5 114313951 6660.
8 Brazil Americas 1982 63.3 128962939 7031.
9 Brazil Americas 1987 65.2 142938076 7807.
10 Brazil Americas 1992 67.1 155975974 6950.
# … with 67 more rows
Exercise 8
Show countries with a population greater than 100 million and life expectancy greater than 70.
The results should look like this:
# A tibble: 27 x 6
country continent year lifeExp pop gdpPercap
<fct> <fct> <int> <dbl> <int> <dbl>
1 Brazil Americas 2002 71.0 179914212 8131.
2 Brazil Americas 2007 72.4 190010647 9066.
3 China Asia 1997 70.4 1230075000 2289.
4 China Asia 2002 72.0 1280400000 3119.
5 China Asia 2007 73.0 1318683096 4959.
6 Indonesia Asia 2007 70.6 223547000 3541.
7 Japan Asia 1967 71.4 100825279 9848.
8 Japan Asia 1972 73.4 107188273 14779.
9 Japan Asia 1977 75.4 113872473 16610.
10 Japan Asia 1982 77.1 118454974 19384.
# … with 17 more rows
Sorting data using arrange()
TODO: replace with video
Video summary:
- The
arrange()function sorts rows in a dataset. - Give
arrange()a single column name to sort data in ascending order. - To sort in descending order, put a
-before the column name. - Use commas between column names to sort by one column within another.
# sort by carat in ascending order
diamonds %>%
arrange(carat) %>%
head(3)
# A tibble: 3 x 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.2 Premium E SI2 60.2 62 345 3.79 3.75 2.27
2 0.2 Premium E VS2 59.8 62 367 3.79 3.77 2.26
3 0.2 Premium E VS2 59 60 367 3.81 3.78 2.24
# sort by carat in descending order
diamonds %>%
arrange(-carat) %>%
head(3)
# A tibble: 3 x 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 5.01 Fair J I1 65.5 59 18018 10.7 10.5 6.98
2 4.5 Fair J I1 65.8 58 18531 10.2 10.2 6.72
3 4.13 Fair H I1 64.8 61 17329 10 9.85 6.43
# sort by price (descending) within carat (ascending)
diamonds %>%
arrange(carat, -price)
# A tibble: 53,940 x 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.2 Premium E VS2 59.8 62 367 3.79 3.77 2.26
2 0.2 Premium E VS2 59 60 367 3.81 3.78 2.24
3 0.2 Premium E VS2 61.1 59 367 3.81 3.78 2.32
4 0.2 Premium E VS2 59.7 62 367 3.84 3.8 2.28
5 0.2 Ideal E VS2 59.7 55 367 3.86 3.84 2.3
6 0.2 Premium F VS2 62.6 59 367 3.73 3.71 2.33
7 0.2 Ideal D VS2 61.5 57 367 3.81 3.77 2.33
8 0.2 Very Good E VS2 63.4 59 367 3.74 3.71 2.36
9 0.2 Ideal E VS2 62.2 57 367 3.76 3.73 2.33
10 0.2 Premium D VS2 62.3 60 367 3.73 3.68 2.31
# … with 53,930 more rowsSort by carat in ascending order
diamonds %>%
arrange(carat) %>%
head(3)
# A tibble: 3 x 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.2 Premium E SI2 60.2 62 345 3.79 3.75 2.27
2 0.2 Premium E VS2 59.8 62 367 3.79 3.77 2.26
3 0.2 Premium E VS2 59 60 367 3.81 3.78 2.24Sort by carat in descending order
diamonds %>%
arrange(-carat) %>%
head(3)
# A tibble: 3 x 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 5.01 Fair J I1 65.5 59 18018 10.7 10.5 6.98
2 4.5 Fair J I1 65.8 58 18531 10.2 10.2 6.72
3 4.13 Fair H I1 64.8 61 17329 10 9.85 6.43Sort by price (descending) within carat (ascending)
diamonds %>%
arrange(carat, -price)
# A tibble: 53,940 x 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.2 Premium E VS2 59.8 62 367 3.79 3.77 2.26
2 0.2 Premium E VS2 59 60 367 3.81 3.78 2.24
3 0.2 Premium E VS2 61.1 59 367 3.81 3.78 2.32
4 0.2 Premium E VS2 59.7 62 367 3.84 3.8 2.28
5 0.2 Ideal E VS2 59.7 55 367 3.86 3.84 2.3
6 0.2 Premium F VS2 62.6 59 367 3.73 3.71 2.33
7 0.2 Ideal D VS2 61.5 57 367 3.81 3.77 2.33
8 0.2 Very Good E VS2 63.4 59 367 3.74 3.71 2.36
9 0.2 Ideal E VS2 62.2 57 367 3.76 3.73 2.33
10 0.2 Premium D VS2 62.3 60 367 3.73 3.68 2.31
# … with 53,930 more rowsExercise 9
- Sort the
diamondsdataset by ascending price. - Show only the first five rows of the results.
Your answer should look like this:
# A tibble: 5 x 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
4 0.290 Premium I VS2 62.4 58 334 4.2 4.23 2.63
5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
Exercise 10
How big is the largest diamond in the diamonds dataset? carats.
What was the cut of the three largest diamonds in that dataset? .
Combining filter() with arrange()
TODO: replace with video
Video summary:
- Pipelines often combine
filter()andarrange()to answer specific questions.
# show Kenyans with lowest life expectancy
gapminder::gapminder %>%
filter(country == "Kenya") %>%
arrange(lifeExp) %>%
head(6)
# A tibble: 6 x 6
country continent year lifeExp pop gdpPercap
<fct> <fct> <int> <dbl> <int> <dbl>
1 Kenya Africa 1952 42.3 6464046 854.
2 Kenya Africa 1957 44.7 7454779 944.
3 Kenya Africa 1962 47.9 8678557 897.
4 Kenya Africa 1967 50.7 10191512 1057.
5 Kenya Africa 2002 51.0 31386842 1288.
6 Kenya Africa 1972 53.6 12044785 1222.
# show Kenyans with highest life expectancy (just add '-')
gapminder::gapminder %>%
filter(country == "Kenya") %>%
arrange(-lifeExp) %>%
head(6)
# A tibble: 6 x 6
country continent year lifeExp pop gdpPercap
<fct> <fct> <int> <dbl> <int> <dbl>
1 Kenya Africa 1987 59.3 21198082 1362.
2 Kenya Africa 1992 59.3 25020539 1342.
3 Kenya Africa 1982 58.8 17661452 1348.
4 Kenya Africa 1977 56.2 14500404 1268.
5 Kenya Africa 1997 54.4 28263827 1360.
6 Kenya Africa 2007 54.1 35610177 1463.What was the year Kenyans had the lowest life exp:
gapminder::gapminder %>%
filter(country == "Kenya") %>%
arrange(lifeExp) %>%
head(6)
# A tibble: 6 x 6
country continent year lifeExp pop gdpPercap
<fct> <fct> <int> <dbl> <int> <dbl>
1 Kenya Africa 1952 42.3 6464046 854.
2 Kenya Africa 1957 44.7 7454779 944.
3 Kenya Africa 1962 47.9 8678557 897.
4 Kenya Africa 1967 50.7 10191512 1057.
5 Kenya Africa 2002 51.0 31386842 1288.
6 Kenya Africa 1972 53.6 12044785 1222.What was the highest year? All that changes is the minus sign (reverse sorting)
gapminder::gapminder %>%
filter(country == "Kenya") %>%
arrange(-lifeExp) %>%
head(6)
# A tibble: 6 x 6
country continent year lifeExp pop gdpPercap
<fct> <fct> <int> <dbl> <int> <dbl>
1 Kenya Africa 1987 59.3 21198082 1362.
2 Kenya Africa 1992 59.3 25020539 1342.
3 Kenya Africa 1982 58.8 17661452 1348.
4 Kenya Africa 1977 56.2 14500404 1268.
5 Kenya Africa 1997 54.4 28263827 1360.
6 Kenya Africa 2007 54.1 35610177 1463.Exercise 11
- Sort Asian countries in the
gapminderdataset in ascending population order. - Show only the first 10 rows.
Your results should look like this:
# A tibble: 10 x 6
country continent year lifeExp pop gdpPercap
<fct> <fct> <int> <dbl> <int> <dbl>
1 Bahrain Asia 1952 50.9 120447 9867.
2 Bahrain Asia 1957 53.8 138655 11636.
3 Kuwait Asia 1952 55.6 160000 108382.
4 Bahrain Asia 1962 56.9 171863 12753.
5 Bahrain Asia 1967 59.9 202182 14805.
6 Kuwait Asia 1957 58.0 212846 113523.
7 Bahrain Asia 1972 63.3 230800 18269.
8 Bahrain Asia 1977 65.6 297410 19340.
9 Kuwait Asia 1962 60.5 358266 95458.
10 Bahrain Asia 1982 69.1 377967 19211.
Exercise 12
Use filter(), arrange() and head() to answer the following questions:
The European country with the highest life expectancy is .
The European country with the fifth largest population was recorded in which year? .
Summarising data using summarise()
TODO: replace with video
Video summary:
- Often you have lots of data and need to make summaries of it, e.g. to calculate the average of a column.
- The
summarise()function takes many rows and uses a function to convert those into fewer rows. - Common summary functions are those which calculate descriptive statistics, like
mean(),median(), andsd(), which is short for standard deviation. - The output of
summarise()is adata.framewhich can be stored in a variable.
# use function mean() to summarise mpg column
# the result is a data.frame with the a single column named 'mean_mpg'
mtcars %>%
summarise(mean_mpg = mean(mpg))
mean_mpg
1 20.09062
# if you omit the column name, the summarised column is named after summary function used to create it
# this produces column names which can be awkward to process later in a pipeline
mtcars %>%
summarise(mean(mpg))
mean(mpg)
1 20.09062
# median mpg
mtcars %>%
summarise(median_mpg = median(mpg))
median_mpg
1 19.2
# standard deviation
mtcars %>%
summarise(sd_mpg = sd(mpg))
sd_mpg
1 6.026948
# summarise two columns at once (functions are separated with a comma)
# store the resulting data.frame in a variable
mtcars_summary <- mtcars %>%
summarise(M = mean(mpg), SD = sd(mpg))Show
mtcars %>% summarise(average_mpg = mean(mpg))
average_mpg
1 20.09062and also
mtcars %>% summarise(mean(mpg))
mean(mpg)
1 20.09062emphasising that the former is better
Also show replacing with median, sd etc.
This is where we fist encounter the need to give things R-legal names so explain about spaces and special characters Emphasise everything should be lower case with underscores.
Then show calculating two new cols at once:
mtcars %>% summarise(M = mean(mpg), SD = sd(mpg))
Point out the comma between the two in commentary
Exercise 13
- Copy the code above into your workbook.
- Amend the code to calculate the median weight.
- Note that
mtcarsstores weight in units of 1000 lbs.
The median car weight is pounds.
Exercise 14
- Amend the code to calculate the mean and standard deviation of weight.
To the nearest pound, the mean car weight is pounds, and the standard deviation is pounds.
Using filter() and summarise() together
TODO: replace with video
Video summary:
- We often want to filter a dataset before summarising.
- We can do this by creating a pipeline with
filter()andsummarise().
# calculate the mean for cars with manual transmission (am == 1)
mtcars %>%
filter(am == 1) %>%
summarise(mean_mpg = mean(mpg))
mean_mpg
1 24.39231Calculate the mean for cars with manual transmission (am == 1)
mtcars %>%
filter(am == 1) %>%
summarise(mean_mpg = mean(mpg))
mean_mpg
1 24.39231Exercise 15
Use filter() and summarise() to calculate the standard deviation of cars with automatic transmission.
Cars with automatic transmission have a standard deviation (to two decimal places) of miles per gallon.
Grouping data with group_by()
TODO: replace with video
Video summary:
- Our data may have categorical or ‘grouping’ variables (e.g. gender, or country).
- We often want to create summaries for each group.
- We could use
filter()andsummary()once for each group, but thegroup_by()function does this for all groups. - Adding
group_by()to a pipeline runs the subsequent steps once for each group. - Be careful only to group by categorical variables.
# boxplot of C02 uptake grouped by grass type
CO2 %>%
ggplot(aes(Type, uptake)) +
geom_boxplot()
# table of C02 uptake grouped by grass type
CO2 %>%
group_by(Type) %>%
summarise(average_uptake = mean(uptake))
# A tibble: 2 x 2
Type average_uptake
* <fct> <dbl>
1 Quebec 33.5
2 Mississippi 20.9
# group by two factors at once: grass type and experimental treatment
CO2 %>%
group_by(Type, Treatment) %>%
summarise(mean(uptake))
# A tibble: 4 x 3
# Groups: Type [2]
Type Treatment `mean(uptake)`
<fct> <fct> <dbl>
1 Quebec nonchilled 35.3
2 Quebec chilled 31.8
3 Mississippi nonchilled 26.0
4 Mississippi chilled 15.8In this video we’ll use a dataset about plants rather than cars. Plants photosynthesise by combining sunlight with carbon dioxide to make sugars. The CO2 dataset carbon dioxide update for two species of grass. The species is a factor.
We might make a plot like this:
CO2 %>%
ggplot(aes(Type, uptake)) +
geom_boxplot()But what if we want these numbers in a table (or to report in our report)? We can do that using group_by and summarise…
CO2 %>%
group_by(Type) %>%
summarise(average_uptake = mean(uptake))
# A tibble: 2 x 2
Type average_uptake
* <fct> <dbl>
1 Quebec 33.5
2 Mississippi 20.9Another factor in this dataset is an experimental treatment – whether the grasses were chilled or nonchilled. We can also group by two factors at once and get a row for each combination:
CO2 %>%
group_by(Type, Treatment) %>%
summarise(mean(uptake))
# A tibble: 4 x 3
# Groups: Type [2]
Type Treatment `mean(uptake)`
<fct> <fct> <dbl>
1 Quebec nonchilled 35.3
2 Quebec chilled 31.8
3 Mississippi nonchilled 26.0
4 Mississippi chilled 15.8Exercise 16
chickwts contains data for the weights of chicks (in grams) fed on different diets.
glimpse(chickwts)
Rows: 71
Columns: 2
$ weight <dbl> 179, 160, 136, 227, 217, 168, 108, 124, 143, 140, 309, 229, 181…
$ feed <fct> horsebean, horsebean, horsebean, horsebean, horsebean, horsebea…Calculate the mean and standard deviation chick weights for each type of feed.
The mean weight of chicks fed on linseed was (to 2 decimal places) g.
The standard deviation of chicks fed on sunflower was (to 2 decimal places) g.
Check your knowledge
Write an answer to each of these questions in the Check your knowledge section of your workbook. The answers will be revealed in Session 3.
- What is the
<-symbol called and what does it do? - What is the
%>%symbol called and what does it do? - Which function is used to load data and what are the two places that data can be loaded from?
- How would you select rows from a
data.framewhere values in a numeric column are between 10 and 20? - How would you sort a
data.frameby a numeric column in descending order? - How would you select rows from a column which match a word, and then sort those rows in ascending order by a numeric column?
- Which functions would you use to calculate the mean of a numeric column?
- Which other two functions are commonly used to calculate descriptive statistics?
- How could you calculate the mean for one level of a factor?
- How would you calculate the mean for all levels of a factor?
Extension exercises
Extension exercise 1
The country with the highest life expectancy in 1952 was . (Hint: use arrange(), filter() and head() with the gapminder dataset.)
Extension exercise 2
The continent with the highest mean per capita GDP in 1987 was . (Hint: use filter(), arrange(), group_by(), and summarise().)
Extension exercise 3
Make a boxplot showing life expectancy by continent for years greater than 1999. (Hint: use filter(), ggplot() and geom_boxplot().)
The plot should look like this:
Extension exercise 4
The last two exercises are like the end of level “boss characters” in a computer game. To beat them, you need to select and combine skills you’ve learnt so far.
Before the year 2000, the African country with the largest population and the lowest life expectancy was .
Extension exercise 5
Make a table which shows the average life expectancy for each continent, sorted from highest to lowest. It should look like this:
# A tibble: 5 x 2
continent life_expectancy
<fct> <dbl>
1 Oceania 74.3
2 Europe 71.9
3 Americas 64.7
4 Asia 60.1
5 Africa 48.9
Broken script to fix
Start a NEW R session and make this code work:
liibrary(todyverse)
# make a density plot of of life expectacy with different color lines for each continent
gapminder %>%
ggplote(aes("lifeExp", colr = "Continent")) geom_density()
# select only years after 1990
gapminder %>%
filter(year > 1990)
ggplot(aes(year, lifeExp, color=continent)) +
geom_jitter()NOTE - we will know all the errors they will see so can provide hints for each of them
# correct version
library(tidyverse)
# make a density plot of of life expectacy with different color lines for each continent
gapminder %>%
ggplot(aes(lifeExp, color = continent)) +
geom_density()
# select only years after 1990
gapminder::gapminder %>%
filter(year > 1990) %>%
ggplot(aes(year, lifeExp, color=continent)) +
geom_jitter()